Pathways Explorer

Authors

Affiliations

Alejandra Escobar

MGnify team at EMBL-EBI

Amartya Nambiar

Google Summer of Code 2023

This is a static preview

You can run and edit these examples interactively on Galaxy

Pathways Explorer

In this notebook we demonstrate how the MGnifyR tool can be used to fetch functional annotation results generated through the MGnify metagenomic analyisis pipelines. Then we show how to generate a pathways visualization using Pathview in R.

MGnifyR is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the MGnify API. One benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter

For this notebook, a lot of the data-wrangling code has been put in a kegg_pathways_utils.r library file.

# Loading libraries:
suppressMessages({
    library(ALDEx2)
    library(data.table)
    library(dplyr)
    library(IRdisplay)
    library(KEGGREST)
    library(MGnifyR)   
    library(pathview)
    library(tidyjson)
    library(IRdisplay)
})
source("lib/variable_utils.r")
source("lib/kegg_pathways_utils.r")

display_markdown(file = '../_resources/mgnifyr_help.md')

Help with MGnifyR

MGnifyR is an R package that provides a convenient way for R users to access data from the MGnify API.

Detailed help for each function is available in R using the standard ?function_name command (i.e. typing ?mgnify_query will bring up built-in help for the mgnify_query command).

A vignette is available containing a reasonably verbose overview of the main functionality. This can be read either within R with the vignette("MGnifyR") command, or in the development repository

MGnifyR Command cheat sheet

The following list of key functions should give a starting point for finding relevent documentation.

mgnify_client() : Create the client object required for all other functions.
mgnify_query() : Search the whole MGnify database.
mgnify_analyses_from_xxx() : Convert xxx accessions to analyses accessions. xxx is either samples or studies.
mgnify_get_analyses_metadata() : Retrieve all study, sample and analysis metadata for given analyses.
mgnify_get_analyses_phyloseq() : Convert abundance, taxonomic, and sample metadata into a single phyloseq object.
mgnify_get_analyses_results() : Get functional annotation results for a set of analyses.
mgnify_download() : Download raw results files from MGnify.
mgnify_retrieve_json() : Low level API access helper function.

# Create your session mgnify_client object
mg = mgnify_client(usecache = T, cache_dir = '/home/jovyan/.mgnify_cache')

# Setting tables and figures size to display (these will be reset later):
options(repr.matrix.max.cols=150, repr.matrix.max.rows=500)
options(repr.plot.width=4, repr.plot.height=4)

Draw presence/absence KOs for one metagenomic sample

1.1. Fetch data from MGnify & Pathways Selection

PATHWAY_STUDY_IDS <- get_variable_from_link_or_input('PATHWAY_STUDY_IDS', name =  'Study Accession', default = 'MGYS00006180,MGYS00006178')
PATHWAY_STUDY_IDS <- c(strsplit(PATHWAY_STUDY_IDS, ",")[[1]])
print(paste("Using", PATHWAY_STUDY_IDS, "as Study Accession"))

Using Study Accession = MGYS00006180,MGYS00006178 from the link you followed.
Using "MGYS00006180,MGYS00006178" as Study Accession
[1] "Using MGYS00006180 as Study Accession"
[2] "Using MGYS00006178 as Study Accession"

# Custom Pathways Selection

# Most general pathways include
# Glycolysis / Gluconeogenesis, Citrate cycle (TCA cycle), Pentose phosphate pathway, 
# Fatty acid biosynthesis, Pyrimidine metabolism, Oxidative phosphorylation

CUSTOM_PATHWAY_IDS <- PathwaysSelection()

Pathways Selection :

For the most general & most complete pathways, input ‘G’
Press Enter to generate the most complete pathways
To add custom pathways, input pathway numbers (ex: 00053,01220)

Using Pathways Accession = G from the link you followed.
Using "G" as Pathways Accession


Using 00010  -  Glycolysis / Gluconeogenesis  :  https://www.kegg.jp/pathway/map00010 as a Custom Pathway
Using 00020  -  Citrate cycle (TCA cycle)  :  https://www.kegg.jp/pathway/map00020 as a Custom Pathway
Using 00030  -  Pentose phosphate pathway  :  https://www.kegg.jp/pathway/map00030 as a Custom Pathway
Using 00061  -  Fatty acid biosynthesis  :  https://www.kegg.jp/pathway/map00061 as a Custom Pathway
Using 01232  -  Nucleotide metabolism  :  https://www.kegg.jp/pathway/map01232 as a Custom Pathway
Using 00240  -  Pyrimidine metabolism  :  https://www.kegg.jp/pathway/map00240 as a Custom Pathway
Using 00190  -  Oxidative phosphorylation  :  https://www.kegg.jp/pathway/map00190 as a Custom Pathway

Fetch the analysis accession list using the study accessions.

output <- capture.output({
  all_accessions <- mgnify_analyses_from_studies(mg, PATHWAY_STUDY_IDS)
  all_metadata <- mgnify_get_analyses_metadata(mg, all_accessions)
})

Keeping just the first analysis accession, fetch the KEGG Orthologs count table from the MGnify API. Transform it from JSON into a matrix.

accession = head(all_accessions, 1)
ko_loc = paste0('analyses/',accession,'/kegg-orthologs')

ko_json = mgnify_retrieve_json(mg, path = ko_loc)
ko_data = as.data.frame(ko_json %>% spread_all)[ , c("attributes.accession", "attributes.count")]
ko_data = data.frame(ko_data, row.names=1)
colnames(ko_data)[1] = 'counts'
ko_matrix = data.matrix(ko_data)
head(ko_matrix, 3)

A matrix: 3 × 1 of type dbl
	counts
K10822	115
K02030	91
K02004	79

Fetch the KEGG Modules completeness table and filter out modules with completeness < 100%.

comp_loc = paste0('analyses/',accession,'/kegg-modules')
ko_comp_json = mgnify_retrieve_json(mg, path = comp_loc)
ko_comp = as.data.frame(ko_comp_json %>% spread_all)
modules = ko_comp[ko_comp$attributes.completeness == 100,][, c("attributes.accession")]
head(modules)

'M00001'
'M00002'
'M00003'
'M00004'
'M00005'
'M00006'

1.2. Select the most complete pathways

# Setting up function that collects KEGG pathways for a given list of IDs, excluding chemical structure & global maps

collect_pathways <- function(ids_list) {
    pathways = list()
    for (id in ids_list) { 
        current_pathway = as.list(keggLink("pathway", id))
        for (index in grep("map", current_pathway)) {        
            clean_id = gsub("*path:", "", current_pathway[index])
            # Discarding chemical structure (map010XX), global (map011XX), and overview (map012XX) maps
            prefix = substring(clean_id, 1, 6)
            if(is.na(match("map010", prefix)) & is.na(match("map011", prefix)) & is.na(match("map012", prefix)) ){
                pathways = append(pathways, clean_id)
            }
        }
    }
    return(pathways)
}

Now we need to collect the list of template pathways where these complete modules can be drawn. This step takes less than 1 minute to run.

md_pathways = collect_pathways(modules)
head(md_pathways)

'map00010'
'map00010'
'map00010'
'map00020'
'map00030'
'map00030'

In order to draw the most complete pathways maps, we will use the list of templates we obtained in the previous step and select only pathways having all their constituent modules.

# Counting the number of modules we have in each pathway
our_pathways_counts = list()
for (path_element in md_pathways) {
    if (path_element %in% names(our_pathways_counts)) {
        new_value = our_pathways_counts[[path_element]] + 1
        our_pathways_counts[path_element] = new_value       
    } else {
        our_pathways_counts[path_element] = 1 
    }
}

# Counting the number of modules expected in each pathway
u_pathways = unique(md_pathways)
exp_pathways_counts = list()
for (path in u_pathways) {
    mod_count = length(as.list(keggLink("module", path)))
    exp_pathways_counts[path] = mod_count 
}

# Selecting the pathways having all their constituent modules. We remove the 'map' prefix as pathview doesn't like it
to_draw = list()
for (pathway in names(our_pathways_counts)) {
    our_value = our_pathways_counts[[pathway]]
    exp_value = exp_pathways_counts[[pathway]]
    ratio =  our_value / exp_value
    if (ratio == 1) {
        nude_id =  gsub("map", "", pathway)
        to_draw = append(to_draw, nude_id)   
    }
}


# Adding the custom pathways to to_draw if not present already
for (pathway in CUSTOM_PATHWAY_IDS){
    if (!(pathway %in% to_draw)) {
    to_draw = append(to_draw, pathway)
        }
    }

# printing name of the pathways to be drawn
for (pathway in to_draw){
    print(paste(pathway, "-->", get_pathway_info(pathway)[1]," : ",get_pathway_info(pathway)[2], sep=" "))
}

[1] "00010 --> Glycolysis / Gluconeogenesis  :  https://www.kegg.jp/pathway/map00010"
[1] "00480 --> Glutathione metabolism  :  https://www.kegg.jp/pathway/map00480"
[1] "00430 --> Taurine and hypotaurine metabolism  :  https://www.kegg.jp/pathway/map00430"
[1] "00521 --> Streptomycin biosynthesis  :  https://www.kegg.jp/pathway/map00521"
[1] "00020 --> Citrate cycle (TCA cycle)  :  https://www.kegg.jp/pathway/map00020"
[1] "00030 --> Pentose phosphate pathway  :  https://www.kegg.jp/pathway/map00030"
[1] "00061 --> Fatty acid biosynthesis  :  https://www.kegg.jp/pathway/map00061"
[1] "01232 --> Nucleotide metabolism  :  https://www.kegg.jp/pathway/map01232"
[1] "00240 --> Pyrimidine metabolism  :  https://www.kegg.jp/pathway/map00240"
[1] "00190 --> Oxidative phosphorylation  :  https://www.kegg.jp/pathway/map00190"

1.3. Ready to draw!

As we are plotting absence/presence, we set the number of bins = 2, the scale in one direction, and use 1 as limit.

This can take a couple of minutes depending on the number of pathways.

suppressMessages({
for (p in to_draw) {
    pathview(gene.data = ko_matrix, 
             species = "ko", 
             pathway.id = p, 
             bins=c(2, 2), 
             both.dirs = FALSE, 
             limit = c(1,1), 
             mid = c("#ffffff" , "#ffffff"), 
             high = c("#02b3ad" , "#02b3ad")
    )
}
})

Clear the current working directory, and display all the generated figures that are stored at the pathway_plots/ directory.

generatePathwayPlots()

Glycolysis / Gluconeogenesis

Citrate cycle (TCA cycle)

Pentose phosphate pathway

Fatty acid biosynthesis

Oxidative phosphorylation

Pyrimidine metabolism

Taurine and hypotaurine metabolism

Glutathione metabolism

Streptomycin biosynthesis

Nucleotide metabolism

Resources and Documentation

MGnify:

https://docs.mgnify.org/src/docs/analysis.html#assembly-analysis-pipeline - Description of the latest MGnify analysis pipeline and the tools it uses.
https://academic.oup.com/nar/article/48/D1/D570/5614179 - This paper highlights the platform’s functionalities for metagenomic data assembly and analysis. It also describes he implementation of KEGG modules in MGnify.

Pathview:

The official documentation for Pathview, a tool for pathway-based data integration and visualization. It provides an overview of the tool and explains how to use it effectively.
You can find more information at https://pathview.uncc.edu/overview#kegg_view .

KEGGREST

KEGGREST is a Bioconductor package that provides programmatic access to the KEGG database.
More about KEGGREST at http://www.bioconductor.org/packages/release/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html

Pathways Explorer

Contents

Help with MGnifyR

MGnifyR Command cheat sheet

Draw presence/absence KOs for one metagenomic sample

1.1. Fetch data from MGnify & Pathways Selection

Pathways Selection :

1.2. Select the most complete pathways

1.3. Ready to draw!

Resources and Documentation

MGnify:

Pathview:

KEGGREST